Deduplicate validation results yield from `validate.validate()` by candleindark · Pull Request #1619 · dandi/dandi-cli

candleindark · 2025-04-24T02:21:10Z

This PR closes #1617.

This solution is a bit awkward in my opinion, but it fixes the problem without overhaul of how validation is done in dandi-cli.

codecov · 2025-04-24T02:24:17Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.43%. Comparing base (4f466a8) to head (1fb98b0).
Report is 72 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1619      +/-   ##
==========================================
+ Coverage   88.28%   88.43%   +0.15%     
==========================================
  Files          78       78              
  Lines       11038    11045       +7     
==========================================
+ Hits         9745     9768      +23     
+ Misses       1293     1277      -16

Flag	Coverage Δ
unittests	`88.43% <100.00%> (+0.15%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull Request Overview

This PR addresses the issue of duplicate validation results from validate.validate() by introducing deduplication logic.

Introduces df_results and df_result_ids to track unique ValidationResult objects.
Updates the loop to yield only unique validation errors.

dandi/validate.py

candleindark · 2025-04-24T02:41:50Z

@yarikoptic Need to have this one approved quick. It is needed for #1599.

yarikoptic · 2025-04-24T12:30:34Z

"not sure", since this would just address the symptom, potentially hiding away some code inefficiency resulting in duplicate records. Did you look why we get duplicate records?

candleindark · 2025-04-24T22:18:48Z

"not sure", since this would just address the symptom, potentially hiding away some code inefficiency resulting in duplicate records. Did you look why we get duplicate records?

There is some inefficiency but not terrible, and this inefficiency is not the result of a bug but somewhat by design. Let's look at the code without the changes in this PR.

By NOT terrible inefficiency I mean that no repetitive validations have been done, and no duplicate ValidationResult objects are actually created. The duplication is resulting from repeated inclusions of the same ValidationResult objects at

dandi-cli/dandi/validate.py

Lines 171 to 176 in 4f466a8

    
           for df in find_dandi_files( 
        
               p, dandiset_path=dandiset_path, allow_all=allow_any_path 
        
           ): 
        
               yield from df.get_validation_errors( 
        
                   schema_version=schema_version, devel_debug=devel_debug 
        
               )

df here is a DandiFile object that represents a "file" in the dandiset. For a dandiset that is also a BIDS dataset, df can be a BIDSDatasetDescriptionAsset or BIDSAsset.

When df is a BIDSDatasetDescriptionAsset, its get_validation_errors() method returns all the ValidationResults pertaining to the dataset and all individual assets,

dandi-cli/dandi/files/bids.py

Lines 112 to 124 in 4f466a8

    
           def get_validation_errors( 
        
               self, 
        
               schema_version: str | None = None, 
        
               devel_debug: bool = False, 
        
           ) -> list[ValidationResult]: 
        
               self._validate() 
        
               assert self._dataset_errors is not None 
        
               if self._asset_errors is not None: 
        
                   return self._dataset_errors + [ 
        
                       i for j in self._asset_errors.values() for i in j 
        
                   ] 
        
               else: 
        
                   return self._dataset_errors

When df is a BIDSAsset, its get_validation_errors() method returns ValidationResults pertaining to the dataset and the particular asset through its attached BIDSDatasetDescriptionAsset object,

dandi-cli/dandi/files/bids.py

Lines 167 to 172 in 4f466a8

    
           def get_validation_errors( 
        
               self, 
        
               schema_version: str | None = None, 
        
               devel_debug: bool = False, 
        
           ) -> list[ValidationResult]: 
        
               return self.bids_dataset_description.get_asset_errors(self)

dandi-cli/dandi/files/bids.py

Lines 96 to 104 in 4f466a8

    
           def get_asset_errors(self, asset: BIDSAsset) -> list[ValidationResult]: 
        
               """:meta private:""" 
        
               self._validate() 
        
               errors: list[ValidationResult] = [] 
        
               if self._dataset_errors: 
        
                   errors.extend(self._dataset_errors) 
        
               assert self._asset_errors is not None 
        
               errors.extend(self._asset_errors[asset.bids_path]) 
        
               return errors

There reason that there is no repetition in validation and creation of the same validation result objects is that all BIDSAsset objects share an instance of BIDSDatasetDescriptionAsset

dandi-cli/dandi/files/bids.py

Line 142 in 4f466a8

bids_dataset_description_ref: weakref.ref[BIDSDatasetDescriptionAsset]

, and BIDSDatasetDescriptionAsset._validate() actually only do validation once in its life time

dandi-cli/dandi/files/bids.py

Lines 64 to 94 in 4f466a8

    
           def _validate(self) -> None: 
        
               with self._lock: 
        
                   if self._dataset_errors is None: 
        
                       # Import here to avoid circular import 
        
                       from dandi.validate import validate_bids 
        
                       results = validate_bids(self.bids_root) 
        
                       self._dataset_errors: list[ValidationResult] = [] 
        
                       self._asset_errors: dict[str, list[ValidationResult]] = defaultdict( 
        
                           list 
        
                       ) 
        
                       # Don't apply eta-reduction to the lambda, as mypy needs to be 
        
                       # assured that defaultdict's argument takes no parameters. 
        
                       self._asset_metadata = defaultdict( 
        
                           lambda: BareAsset.model_construct()  # type: ignore[call-arg] 
        
                       ) 
        
                       for result in results: 
        
                           if result.id in BIDS_ASSET_ERRORS: 
        
                               assert result.path 
        
                               bids_path = result.path.relative_to(self.bids_root).as_posix() 
        
                               self._asset_errors[bids_path].append(result) 
        
                           elif result.id in BIDS_DATASET_ERRORS: 
        
                               self._dataset_errors.append(result) 
        
                           elif result.id == "BIDS.MATCH": 
        
                               assert result.path 
        
                               bids_path = result.path.relative_to(self.bids_root).as_posix() 
        
                               assert result.metadata is not None 
        
                               self._asset_metadata[bids_path] = prepare_metadata( 
        
                                   result.metadata 
        
                               ) 
        
                               self._bids_version = result.origin.standard_version

In sum, the repetition is only in the inclusions of the same ValidationResult objects (in validate.validate()) and results from how files in dandiset are currently represented.

github-actions · 2025-05-13T19:14:32Z

🚀 PR was released in 0.68.1 🚀

candleindark requested a review from Copilot April 24, 2025 02:36

Copilot AI reviewed Apr 24, 2025

View reviewed changes

dandi/validate.py Outdated Show resolved Hide resolved

fix: deduplicate validation results yield from validate.validate()

1fb98b0

candleindark force-pushed the unique-errors branch from 170fd43 to 1fb98b0 Compare April 24, 2025 02:40

candleindark requested a review from yarikoptic April 24, 2025 02:40

yarikoptic added UX patch Increment the patch version when merged labels Apr 25, 2025

yarikoptic merged commit a774355 into dandi:master Apr 25, 2025
29 of 30 checks passed

github-actions bot added the released label May 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicate validation results yield from `validate.validate()`#1619

Deduplicate validation results yield from `validate.validate()`#1619
yarikoptic merged 1 commit intodandi:masterfrom
candleindark:unique-errors

candleindark commented Apr 24, 2025 •

edited

Loading

Uh oh!

codecov bot commented Apr 24, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

candleindark commented Apr 24, 2025

Uh oh!

yarikoptic commented Apr 24, 2025

Uh oh!

candleindark commented Apr 24, 2025

Uh oh!

Uh oh!

github-actions bot commented May 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

candleindark commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

candleindark commented Apr 24, 2025

Uh oh!

yarikoptic commented Apr 24, 2025

Uh oh!

candleindark commented Apr 24, 2025

Uh oh!

Uh oh!

github-actions bot commented May 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

candleindark commented Apr 24, 2025 •

edited

Loading

codecov bot commented Apr 24, 2025 •

edited

Loading